CMSC320 Final Project¶

Applying the Data Science Pipeline to FIRST Robotics Competition Data¶

By Joshua Kalampanayil¶

Introduction¶

FIRST (For Inspiration and Recognition of Science and Technology) Robotics is a nonprofit group that combines both the field of robotics and competition on a K-12 scale. I had the opportunity to be a part of a FIRST Robotics Competition (FRC) team, FIRST's program tier designed for high school students, during my 4 years in high school.

The structure of a normal FRC season is fairly straightforward: every January, a new game is released. Teams have until the end of February to build a robot (tyypically weighing around 125lbs) to accomplish tasks within the game. From the end of February until April, teams compete at events within their district/attend regional events to compete against other teams and seek to become the winner of the event. In a district system, teams seek to win events to earn district points that determine whether they can attend first attend their district championship, and from there, the FRC Championship in May. In a regional system, teams seek to win at that event, as it is a direct ticket to the FRC Championship.

At each event, teams compete against each other during qualifications in two alliances of 3 teams (namely, The Blue Alliance and The Red Alliance), and can earn up to 4 ranking points in a match: 2 for winning, and 2 for completing game specific tasks that vary from season to season. After qualifications, the Top 8 ranked teams select 2 other teams to join their final alliance, and the 8 alliances battle it out in quarterfinals, semifinals, and finals matches, until there is 1 alliance standing. While finals alliances consist of 3 teams initially, this can become a team of 4 alliances if one team breaks down and needs to be subbed in with a backup team. Thus, there can be up to 4 winning teams at each event.

In this tutorial, I will use FRC data to walk through the data science pipeline, and use machine learning to predict winners of an event from qualification data.

Data Collection and Processing¶

For the data, I made two distinct choices: 1) I focused on data from the 2019 season. This was the last "normal" season, as the season was cut short in 2020 due to COVID-19, competitions didn't occur in 2021, and different districts took on different event models for the 2022 season. As such, the 2019 season was selected for consistency. 2) Given that I participated in FRC for 4 years, I decided to focus on my home district, FIRST Chesapeake (CHS). This district has events that occur across Maryland and Virginia, which can be seen in the event codes for each respective event (besides the district championship event)

For the data, there were 2 sources I could retrieve data from: FRC Event Web, an API provided directly from FIRST, or through the API from The Blue Alliance (TBA). TBA provides data directly from the FRC Event Web API, as well as some additional statistics. From a visual perspective, TBA is also easier to navigate, and links match videos directly from YouTube to each match. From a data perspective, because of the added statistics, I decided to use the API from TBA.

In [44]:
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from scipy import stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import sklearn.model_selection as ms
import sklearn.metrics as met
In [2]:
# This is a token needed to read data from the API -- it only has read access
auth = "ruwUB8CNwFYGz3EuHThYzGcd39CRPHgnV8YgEAV0FWu0Ffj200iEXSvyfTLeETgy"
In [3]:
headers = {'X-TBA-Auth-Key' : auth}

First, I needed to get the different events that happened in CHS in 2019.

In [4]:
events_request = requests.get("https://www.thebluealliance.com/api/v3/district/2019chs/events/keys", headers = headers)
In [5]:
json = events_request.json()
In [6]:
district_events = pd.DataFrame.from_dict(json)
In [7]:
district_events
Out[7]:
0
0 2019chcmp
1 2019mdbet
2 2019mdowi
3 2019mdoxo
4 2019vabla
5 2019vagle
6 2019vahay
7 2019vapor

From this list, we can see that there were 7 district events, along with the district championship. In order to demonstrate the curation process step by step, I will be demonstrating it first for the first district event, 2019mdbet. Afterward, I'll repeat the process for all the other events so that I can properly explore the data.

The one downside with the TBA API is that all of the information about events, teams, and scores are spread across various API calls. As such, I will be going through each and compiling the information together in one DataFrame.

For the first event, I need to gather what teams were at the event, as well some basic information about each team.

In [8]:
teams_request = requests.get("https://www.thebluealliance.com/api/v3/event/2019mdbet/teams", headers = headers)
json = teams_request.json()
teams_at_event = pd.DataFrame.from_dict(json)
teams_at_event
Out[8]:
address city country gmaps_place_id gmaps_url key lat lng location_name motto name nickname postal_code rookie_year school_name state_prov team_number website
0 None Edgewater USA None None frc1111 None None None None NASA/The Power Hawks Robotics Club, Inc./Anne ... Power Hawks Robotics 21037 2003 South River Senior High School Maryland 1111 http://www.powerhawks.org
1 None Fairfax USA None None frc1123 None None None None Sony/AIM ROVER/Redeeming Grace Church&Neighbor... AIM ⛟ Robotics 22030 2003 Neighborhood Group Virginia 1123 http://1123.team
2 None Bethesda USA None None frc1389 None None None None Leidos/Domaine an RLAH Group/The Travel Fairy/... The Body Electric 20817 2004 Walt Whitman High School Maryland 1389 http://www.team1389.org
3 None Washington USA None None frc1446 None None None None Friendship Public Charter School/Bechtel Corpo... Robo Knights 20019 2004 Friendship Pcs-Collegiate Acad District of Columbia 1446 http://www.firstinspires.org/
4 None Lutherville Timonium USA None None frc1727 None None None None Friends and Family of REX/Levis Family Foundat... REX 21093 2006 Dulaney High School Maryland 1727 https://www.dulaneyrobotics.org
5 None Haymarket USA None None frc1885 None None None None US STEM Foundation/Lockheed Martin/Macedon Tec... ILITE Robotics 20169 2006 Battlefield High School Virginia 1885 http://www.ilite.us
6 None Washington USA None None frc1915 None None None None NASA Headquarters/Bechtel/DC Public Schools/Go... MTHS Firebird Robotics 20002 2006 Mckinley Tech High School District of Columbia 1915 http://www.firstinspires.org/
7 None Chantilly USA None None frc2186 None None None None BAE Systems/CACI/ICF/The Monachello Family/The... Dogs of Steel 20151 2007 Westfield High School Virginia 2186 http://www.dogsofsteel.org
8 None Columbia USA None None frc2537 None None None None Maryland State Department of Education/Marylan... Space RAIDers 21044 2008 Atholton High School Maryland 2537 http://www.team2537.com
9 None Columbia USA None None frc2849 None None None None Maryland State Department of Education / Varin... Ursa Major 21046 2009 Hammond High School Maryland 2849 http://www.hammondursamajor.org/ursamajor2849/
10 None Washington USA None None frc2900 None None None None DCPS/Google/United Therapeutics&School Without... The Mighty Penguins 20037 2009 School Without Walls Shs District of Columbia 2900 http://www.firstinspires.org/
11 None Washington USA None None frc2912 None None None None Google / Bechtel / DCPS CTE & Phelps Ace High ... Panther Robotics 20002 2009 Phelps Ace High School District of Columbia 2912 http://www.firstinspires.org/
12 None Washington USA None None frc2914 None None None None Bechtel/Amazon/NASA&Woodrow Wilson Senior High... TIGER PRIDE 20016 2009 Woodrow Wilson Senior High Sch District of Columbia 2914 https://www.wilsonrobotics.net
13 None Salisbury USA None None frc3389 None None None None Wicomico County Robotics Club / NASA- Wallops ... TEC Tigers 21804 2010 Parkside High School - CTE & Parkside High School Maryland 3389 https://wicomicocountyroboticsclub.weebly.com
14 None La Plata USA None None frc3650 None None None None Department of Defense, STEM / Navy Surface War... RoboRaptors 20646 2011 St Charles High School & North Pt Hs-Sci Tech ... Maryland 3650 http://www.firstinspires.org/
15 None Frederick USA None None frc3793 None None None None Bechtel/Lockheed Martin/Leidos/FCPS MD&Middlet... CyberTitans 21703 2011 Middletown High School & Tuscarora High School Maryland 3793 http://cybertitans3793.com
16 None Washington USA None None frc4456 None None None None Edison Electrical/Leidos&St John's College Hig... Mech Cadets 20015 2013 St John's College High School District of Columbia 4456 https://frc4456.com/
17 None Laurel USA None None frc4464 None None None None Chesapeake Lighthouse Foundation/MSBR/Abbott&C... Team Illusion 20707 2013 Chesapeake Math & It Pc-N-Ms Maryland 4464 http://www.teamillusion4464.com/
18 None Woodbridge USA None None frc4472 None None None None Lockheed Martin/Micron Technology/Raytheon Tec... SuperNOVA 22192 2013 Family/Community Virginia 4472 http://4472supernova.org/
19 None Silver Spring USA None None frc449 None None None None Intelligent Automation Inc./MBHS Magnet Founda... The Blair Robot Project 20901 2000 Montgomery Blair High School Maryland 449 https://robot.mbhs.edu/
20 None Huntingtown USA None None frc4514 None None None None DoDSTEM/Booz-Allen-Hamilton/Calvert Help Assoc... Calvert STEAM Works 20639 2013 Northern High School & Huntingtown High School Maryland 4514 https://sites.google.com/view/steamworks4514
21 None Washington USA None None frc4821 None None None None United Therapeutics/FIRST Chesapeake/ Capital ... cyberUs 20011 2013 District of Columbia International School District of Columbia 4821 http://www.cyberus4821.weebly.com
22 None Riverdale USA None None frc4949 None None None None City of College Park, MD / DoDSTEM / Leidos / ... Robo Panthers 20737 2014 Parkdale High School Maryland 4949 http:///www.phsrobopanthers.org
23 None Silver Spring USA None None frc5115 None None None None Montgomery County Public Schools/GEICO Informa... Knight Riders 20906 2014 Wheaton Senior High School Maryland 5115 https://wheatonrobotics.org/
24 None Clifton USA None None frc5243 None None None None Leidos/IBM/George Mason University/TapHere! Te... Aegis Robotics 20124 2014 Centreville High School Virginia 5243 http://www.centrevillerobotics.net/
25 None Alexandria USA None None frc5587 None None None None DoD STEM/Google/Boeing/Raytheon/Comcast/Intuit... Titan Robotics 22302 2015 Alexandria City High School Virginia 5587 https://frc5587.org
26 None Gambrills USA None None frc5830 None None None None Trek Networks/Maryland Space Business Roundtab... LIFE Engineering 21054 2016 Home School Maryland 5830 http://www.team5830.org
27 None Frederick USA None None frc5841 None None None None Maryland State Department of Education/Lockhee... The Patriots 21701 2016 Gov Thomas Johnson High School Maryland 5841 http://www.firstinspires.org/
28 None Fulton USA None None frc5945 None None None None Greenebaum Enterprises/Eaton/W R Grace/Hackgro... |CTRL| (Absolute Control) 20759 2016 Family/Community Maryland 5945 http://frc.thehackground.org
29 None Bowie USA None None frc6213 None None None None Patient First/uBreakiFix/SnapMobile/Argosy/Lei... Team Quantum 20715 2016 Bowie High School Maryland 6213 http:///www.teamquantum#6213.org
30 None Capitol Heights USA None None frc6239 None None None None Maryland Space Business Roundtable/Omnyon/Cere... The Irrational Engineers 20715 2016 Family/Community Maryland 6239 http://www.theirrationalengineers.com/
31 None Baltimore USA None None frc6326 None None None None Galen Robotics/The Abell Foundation&Northeast ... ⚡ Baltimore Bolts ⚡ 21230 2017 Baltimore City College HS 480 & Northeast Seni... Maryland 6326 http://www.baltimorebolts.com
32 None Thurmont USA None None frc686 None None None None Maryland State Department of Education/Bechtel... Bovine Intervention 21788 2001 Catoctin High School & Linganore High School &... Maryland 686 https://sites.google.com/view/firstteam686/abo...
33 None Greenbelt USA None None frc6893 None None None None MASER/Washington Academy of Sciences/Sigma Xi/... Bladerunners 20710 2018 Family/Community Maryland 6893 http:///www.maserdc.org
34 None Washington USA None None frc7714 None None None None Cardozo Education Campus RedRoad 20009 2019 Cardozo Education Campus District of Columbia 7714 None
35 None Bel Air USA None None frc7770 None None None None Family/Community Infinite Voltage 21014 2019 Family/Community Maryland 7770 None

From the DataFrame above, we can see an abundance of information. A lot of this I would consider as unnecessary, except the key, so that I can identify the team, as well as the rookie year of the team -- there might be a relationship between team age and scores/winning. I decided to keep these two columns, but drop everything else.

In [9]:
teams_at_event = teams_at_event[['key', 'rookie_year']]
In [10]:
teams_at_event
Out[10]:
key rookie_year
0 frc1111 2003
1 frc1123 2003
2 frc1389 2004
3 frc1446 2004
4 frc1727 2006
5 frc1885 2006
6 frc1915 2006
7 frc2186 2007
8 frc2537 2008
9 frc2849 2009
10 frc2900 2009
11 frc2912 2009
12 frc2914 2009
13 frc3389 2010
14 frc3650 2011
15 frc3793 2011
16 frc4456 2013
17 frc4464 2013
18 frc4472 2013
19 frc449 2000
20 frc4514 2013
21 frc4821 2013
22 frc4949 2014
23 frc5115 2014
24 frc5243 2014
25 frc5587 2015
26 frc5830 2016
27 frc5841 2016
28 frc5945 2016
29 frc6213 2016
30 frc6239 2016
31 frc6326 2017
32 frc686 2001
33 frc6893 2018
34 frc7714 2019
35 frc7770 2019

Now that I have the teams at the event, I want to collect data about each match. This involves another API call.

In [11]:
final = []
In [12]:
match_request = requests.get("https://www.thebluealliance.com/api/v3/event/2019mdbet/matches", headers = headers)
json = match_request.json()
matches = pd.DataFrame.from_dict(json)
matches
Out[12]:
actual_time alliances comp_level event_key key match_number post_result_time predicted_time score_breakdown set_number time videos winning_alliance
0 1552248703 {'blue': {'dq_team_keys': [], 'score': 80, 'su... f 2019mdbet 2019mdbet_f1m1 1 1552248893 1552248722 {'blue': {'adjustPoints': 0, 'autoPoints': 12,... 1 1552248360 [{'key': 'jpggyoCj7fc', 'type': 'youtube'}] blue
1 1552249441 {'blue': {'dq_team_keys': [], 'score': 76, 'su... f 2019mdbet 2019mdbet_f1m2 2 1552249626 1552249562 {'blue': {'adjustPoints': 0, 'autoPoints': 15,... 1 1552248780 [{'key': 'Ai35S_iX0wI', 'type': 'youtube'}] red
2 1552250204 {'blue': {'dq_team_keys': [], 'score': 74, 'su... f 2019mdbet 2019mdbet_f1m3 3 1552250460 1552250283 {'blue': {'adjustPoints': 0, 'autoPoints': 12,... 1 1552249200 [{'key': 'Q1EAiCSyLqc', 'type': 'youtube'}] blue
3 1552240516 {'blue': {'dq_team_keys': [], 'score': 24, 'su... qf 2019mdbet 2019mdbet_qf1m1 1 1552240692 1552240564 {'blue': {'adjustPoints': 0, 'autoPoints': 12,... 1 1552240800 [{'key': 'F189-5URbTU', 'type': 'youtube'}] red
4 1552242896 {'blue': {'dq_team_keys': [], 'score': 41, 'su... qf 2019mdbet 2019mdbet_qf1m2 2 1552243137 1552243024 {'blue': {'adjustPoints': 0, 'autoPoints': 12,... 1 1552242480 [{'key': 's2-2HVQeIdE', 'type': 'youtube'}] red
... ... ... ... ... ... ... ... ... ... ... ... ... ...
83 1552245064 {'blue': {'dq_team_keys': [], 'score': 64, 'su... sf 2019mdbet 2019mdbet_sf1m1 1 1552245241 1552245003 {'blue': {'adjustPoints': 0, 'autoPoints': 15,... 1 1552245840 [{'key': 'g1G_BORXPVA', 'type': 'youtube'}] blue
84 1552246734 {'blue': {'dq_team_keys': [], 'score': 44, 'su... sf 2019mdbet 2019mdbet_sf1m2 2 1552246917 1552246804 {'blue': {'adjustPoints': 0, 'autoPoints': 15,... 1 1552246680 [{'key': 'OeUp7Sv3ULA', 'type': 'youtube'}] red
85 1552247740 {'blue': {'dq_team_keys': [], 'score': 59, 'su... sf 2019mdbet 2019mdbet_sf1m3 3 1552247911 1552247706 {'blue': {'adjustPoints': 0, 'autoPoints': 9, ... 1 1552247520 [{'key': 'GhCt7EQadKQ', 'type': 'youtube'}] blue
86 1552245866 {'blue': {'dq_team_keys': [], 'score': 78, 'su... sf 2019mdbet 2019mdbet_sf2m1 1 1552246045 1552245842 {'blue': {'adjustPoints': 0, 'autoPoints': 15,... 2 1552246260 [{'key': 'oZN0YtEyoVk', 'type': 'youtube'}] blue
87 1552247203 {'blue': {'dq_team_keys': [], 'score': 79, 'su... sf 2019mdbet 2019mdbet_sf2m2 2 1552247378 1552247183 {'blue': {'adjustPoints': 0, 'autoPoints': 12,... 2 1552247100 [{'key': 'AG4tsEDb6aU', 'type': 'youtube'}] blue

88 rows × 13 columns

From this printout and the API documentation, there's two important things to notice: 1) The information about match scores are nested in another structure under the "alliances" column which will involve separating that apart further to access that data. 2) The match data above includes information about playoff matches as well (quarterfinals, semifinals, and finals), which is not useful for the scope I've chosen, analyzing just the qualification match data.

Based on these two observations, I first decided to remove all playoff data.

In [13]:
# Removing any playoff match data
qualifications = matches[~matches['key'].str.contains("_f|_qf|_sf")]

After removing playoff data, I began to dissect the scores.

In [14]:
# Grabbing the name of the event (for later use)
event_name = qualifications.iat[0,3]
qualification_scores = qualifications[['alliances']]
In [15]:
qualification_scores
Out[15]:
alliances
11 {'blue': {'dq_team_keys': [], 'score': 30, 'su...
12 {'blue': {'dq_team_keys': [], 'score': 23, 'su...
13 {'blue': {'dq_team_keys': [], 'score': 27, 'su...
14 {'blue': {'dq_team_keys': [], 'score': 18, 'su...
15 {'blue': {'dq_team_keys': [], 'score': 32, 'su...
... ...
78 {'blue': {'dq_team_keys': [], 'score': 64, 'su...
79 {'blue': {'dq_team_keys': [], 'score': 71, 'su...
80 {'blue': {'dq_team_keys': [], 'score': 44, 'su...
81 {'blue': {'dq_team_keys': [], 'score': 50, 'su...
82 {'blue': {'dq_team_keys': [], 'score': 11, 'su...

72 rows × 1 columns

Even though I removed playoff data, I still want to keep information about what teams won at the event, to identify any possible correlations between qualification performance and winning. This information was not provided in this API call, so I needed to make another API call that would contain this data (namely, in the Awards API call).

In [16]:
winner_request = requests.get("https://www.thebluealliance.com/api/v3/event/2019mdbet/awards", headers = headers)
json = winner_request.json()
awards = pd.DataFrame.from_dict(json)
winners = pd.DataFrame((awards[awards['name'] == "District Event Winner"])['recipient_list'].tolist())

# Since there can be between 3-4 winners per event, this code block breaks up the winners provided from the above line into individual columns, and then from there,
# I gathered all of the keys of the winning teams.
winner_list = []
for index, column in winners.iteritems():
    for awardee, team in enumerate(column):
        winner_list.append(team['team_key'])
In [17]:
winner_list
Out[17]:
['frc1885', 'frc449', 'frc2849']

After gathering winners, I then continued to dissect qualification match scores. The API call split the scores based on the two alliances, so I took a further step and created two DataFrames to handle the two alliances separately.

In [18]:
blue = pd.DataFrame((pd.DataFrame(qualification_scores['alliances'].tolist()))['blue'].tolist())
red = pd.DataFrame((pd.DataFrame(qualification_scores['alliances'].tolist()))['red'].tolist())
In [19]:
blue, red
Out[19]:
(   dq_team_keys  score surrogate_team_keys                    team_keys
 0            []     30                  []  [frc2186, frc2900, frc1727]
 1            []     23                  []  [frc1446, frc1885, frc2537]
 2            []     27                  []  [frc5830, frc6893, frc1123]
 3            []     18                  []  [frc6326, frc2914, frc4456]
 4            []     32                  []   [frc686, frc4514, frc5115]
 ..          ...    ...                 ...                          ...
 67           []     64                  []  [frc2900, frc3793, frc1885]
 68           []     71                  []  [frc4472, frc5587, frc6326]
 69           []     44                  []  [frc1123, frc1111, frc6213]
 70           []     50                  []  [frc4949, frc5115, frc1389]
 71           []     11                  []  [frc6213, frc7714, frc3389]
 
 [72 rows x 4 columns],
    dq_team_keys  score surrogate_team_keys                    team_keys
 0            []     47                  []   [frc5841, frc686, frc1885]
 1            []     62                  []   [frc3650, frc686, frc2912]
 2            []     26                  []  [frc7770, frc1915, frc1111]
 3            []     23                  []  [frc4464, frc2900, frc5587]
 4            []     21                  []   [frc5841, frc6213, frc449]
 ..          ...    ...                 ...                          ...
 67           []     38                  []  [frc5115, frc3650, frc2914]
 68           []     53                  []  [frc1727, frc5243, frc5841]
 69           []     36                  []  [frc4821, frc2186, frc1915]
 70           []     45                  []  [frc6239, frc5243, frc4472]
 71           []     32                  []   [frc449, frc2849, frc3793]
 
 [72 rows x 4 columns])

Now, putting it all together: I looped through every team and matched up their scores for each match they played, what alliance they were on for that match, and if they ultimately won the event.

In [20]:
final = []
In [21]:
for index1, row1 in teams_at_event.iterrows():
    team = row1['key']
    # Seeing if the team was on the Blue Alliance for the match
    for index, row in blue.iterrows():
        if team in row['team_keys']:
            if team in winner_list:
                final.append([team, row1['rookie_year'], row['score'], event_name, "blue", True])
            else:
                final.append([team, row1['rookie_year'], row['score'], event_name, "blue", False])
    # Seeing if the team was on the Red Alliance for the match
    for index, row in red.iterrows():
        if team in row['team_keys']:
            if team in winner_list:
                final.append([team, row1['rookie_year'], row['score'], event_name, "red", True])
            else:
                final.append([team, row1['rookie_year'], row['score'], event_name, "red", False])
In [22]:
all_matches = pd.DataFrame (final, columns = ['team_name', "rookie_year", "score", "event_code", "alliance_color", "won_event"])
In [23]:
all_matches
Out[23]:
team_name rookie_year score event_code alliance_color won_event
0 frc1111 2003 29 2019mdbet blue False
1 frc1111 2003 36 2019mdbet blue False
2 frc1111 2003 54 2019mdbet blue False
3 frc1111 2003 29 2019mdbet blue False
4 frc1111 2003 42 2019mdbet blue False
... ... ... ... ... ... ...
427 frc7770 2019 42 2019mdbet red False
428 frc7770 2019 41 2019mdbet red False
429 frc7770 2019 83 2019mdbet red False
430 frc7770 2019 45 2019mdbet red False
431 frc7770 2019 54 2019mdbet red False

432 rows × 6 columns

After compiling all of the match data together, I added one more column for the 3 additional statistics that the TBA API provided, which also required another API call: 1) Offensive Power Rating (OPR): A measure of how many points (on average) an individual team contributes to the overall score of each match (higher score is better). 2) Defensive Power Rating (DPR): A measure of how defensive a robot is (lower score is better). 3) Calculated Contribution to Winning Margin (CCWM): A measure of how impactful a team is toward helping the alliance win a match (higher score is better).

In [24]:
stats = requests.get("https://www.thebluealliance.com/api/v3/event/2019mdbet/oprs", headers = headers)
json = stats.json()
oprs = pd.DataFrame.from_dict(json)
oprs
Out[24]:
ccwms dprs oprs
frc1111 -3.961947 14.576206 10.614260
frc1123 8.115785 9.208252 17.324036
frc1389 -4.272037 15.681795 11.409758
frc1446 -5.162219 13.027866 7.865647
frc1727 17.734509 8.645342 26.379851
frc1885 6.609230 15.558880 22.168110
frc1915 0.739884 5.308569 6.048453
frc2186 -15.523956 20.251473 4.727517
frc2537 -12.371876 18.952392 6.580516
frc2849 2.596591 8.860362 11.456953
frc2900 -3.647268 8.750373 5.103105
frc2912 11.489318 8.845674 20.334992
frc2914 2.306202 10.556711 12.862913
frc3389 4.084222 7.150827 11.235049
frc3650 -10.128670 17.820775 7.692105
frc3793 14.528858 10.152337 24.681195
frc4456 6.401417 11.303293 17.704710
frc4464 -2.265041 11.673013 9.407972
frc4472 -1.487262 23.822118 22.334856
frc449 1.735295 13.983416 15.718712
frc4514 8.115019 9.552085 17.667104
frc4821 9.727078 8.037392 17.764470
frc4949 -3.317042 7.749086 4.432044
frc5115 -2.545746 14.725956 12.180209
frc5243 -7.922654 17.923467 10.000813
frc5587 7.753885 13.965552 21.719436
frc5830 -1.711540 14.820681 13.109141
frc5841 -9.606052 16.133827 6.527775
frc5945 -4.230201 12.911656 8.681455
frc6213 -0.194984 3.761038 3.566054
frc6239 1.507788 10.640921 12.148709
frc6326 -9.716943 16.102791 6.385848
frc686 1.069198 17.151940 18.221138
frc6893 -4.061566 11.713810 7.652244
frc7714 -8.525617 12.449991 3.924374
frc7770 6.138342 12.646798 18.785140
In [25]:
for index, row in oprs.iterrows():
    oprs.at[index, 'team_name'] = index
In [26]:
all_matches = pd.merge(all_matches, oprs, on='team_name')
In [27]:
all_matches
Out[27]:
team_name rookie_year score event_code alliance_color won_event ccwms dprs oprs
0 frc1111 2003 29 2019mdbet blue False -3.961947 14.576206 10.61426
1 frc1111 2003 36 2019mdbet blue False -3.961947 14.576206 10.61426
2 frc1111 2003 54 2019mdbet blue False -3.961947 14.576206 10.61426
3 frc1111 2003 29 2019mdbet blue False -3.961947 14.576206 10.61426
4 frc1111 2003 42 2019mdbet blue False -3.961947 14.576206 10.61426
... ... ... ... ... ... ... ... ... ...
427 frc7770 2019 42 2019mdbet red False 6.138342 12.646798 18.78514
428 frc7770 2019 41 2019mdbet red False 6.138342 12.646798 18.78514
429 frc7770 2019 83 2019mdbet red False 6.138342 12.646798 18.78514
430 frc7770 2019 45 2019mdbet red False 6.138342 12.646798 18.78514
431 frc7770 2019 54 2019mdbet red False 6.138342 12.646798 18.78514

432 rows × 9 columns

Thus, this completes the data curation for the 2019mdbet event. We now have each team that competed in this event, their rookie year, their score for each qualification match played, whether they won the 2019mdbet event, and their different statisic values. I will now repeat this process for all of the other events.

In [28]:
district_events = district_events.drop(district_events.index[1])
district_events
Out[28]:
0
0 2019chcmp
2 2019mdowi
3 2019mdoxo
4 2019vabla
5 2019vagle
6 2019vahay
7 2019vapor
In [29]:
for index, event in district_events.iterrows():
    event_name = event[0]
    
    # Team Curation
    url = "https://www.thebluealliance.com/api/v3/event/" + event_name + "/teams"
    teams_request = requests.get(url, headers = headers)
    json = teams_request.json()
    teams_at_event = pd.DataFrame.from_dict(json)
    teams_at_event = teams_at_event[['key', 'rookie_year']]
    
    # Grabbing match data
    final = []
    url = "https://www.thebluealliance.com/api/v3/event/" + event_name + "/matches"
    match_request = requests.get(url, headers = headers)
    json = match_request.json()
    matches = pd.DataFrame.from_dict(json)
    
    qualifications = matches[~matches['key'].str.contains("_f|_qf|_sf")]
    qualification_scores = qualifications[['alliances']]
    
    # Grabbing winners at each event -- the championship event had a different parameter in the API to be searched on,
    # so that distinction was made here.
    if event_name == "2019chcmp":
        url = "https://www.thebluealliance.com/api/v3/event/" + event_name + "/awards"
        winner_request = requests.get(url, headers = headers)
        json = winner_request.json()
        awards = pd.DataFrame.from_dict(json)
        winners = pd.DataFrame((awards[awards['name'] == "District Championship Winner"])['recipient_list'].tolist())

        winner_list = []
        for index, column in winners.iteritems():
            for awardee, team in enumerate(column):
                winner_list.append(team['team_key'])
        
    else:
        url = "https://www.thebluealliance.com/api/v3/event/" + event_name + "/awards"
        winner_request = requests.get(url, headers = headers)
        json = winner_request.json()
        awards = pd.DataFrame.from_dict(json)
        winners = pd.DataFrame((awards[awards['name'] == "District Event Winner"])['recipient_list'].tolist())

        winner_list = []
        for index, column in winners.iteritems():
            for awardee, team in enumerate(column):
                winner_list.append(team['team_key'])
    
    # Separating match data by alliance for score pulling
    blue = pd.DataFrame((pd.DataFrame(qualification_scores['alliances'].tolist()))['blue'].tolist())
    red = pd.DataFrame((pd.DataFrame(qualification_scores['alliances'].tolist()))['red'].tolist())
    
    for index1, row1 in teams_at_event.iterrows():
        team = row1['key']
        for index, row in blue.iterrows():
            if team in row['team_keys']:
                if team in winner_list:
                    final.append([team, row1['rookie_year'], row['score'], event_name, "blue", True])
                else:
                    final.append([team, row1['rookie_year'], row['score'], event_name, "blue", False])
        for index, row in red.iterrows():
            if team in row['team_keys']:
                if team in winner_list:
                    final.append([team, row1['rookie_year'], row['score'], event_name, "red", True])
                else:
                    final.append([team, row1['rookie_year'], row['score'], event_name, "red", False])
    
    compiled = pd.DataFrame (final, columns = ['team_name', "rookie_year", "score", "event_code", "alliance_color", "won_event"])
    
    # Grabbing OPRs and other statistics, and combining with existing data for 2019mdbet
    url = "https://www.thebluealliance.com/api/v3/event/" + event_name + "/oprs"
    stats = requests.get(url, headers = headers)
    json = stats.json()
    oprs = pd.DataFrame.from_dict(json)
    
    for index, row in oprs.iterrows():
        oprs.at[index, 'team_name'] = index

    compiled = pd.merge(compiled, oprs, on='team_name')
    
    all_matches = pd.concat([all_matches, compiled])
In [30]:
all_matches = all_matches.reset_index()
all_matches
Out[30]:
index team_name rookie_year score event_code alliance_color won_event ccwms dprs oprs
0 0 frc1111 2003 29 2019mdbet blue False -3.961947 14.576206 10.614260
1 1 frc1111 2003 36 2019mdbet blue False -3.961947 14.576206 10.614260
2 2 frc1111 2003 54 2019mdbet blue False -3.961947 14.576206 10.614260
3 3 frc1111 2003 29 2019mdbet blue False -3.961947 14.576206 10.614260
4 4 frc1111 2003 42 2019mdbet blue False -3.961947 14.576206 10.614260
... ... ... ... ... ... ... ... ... ... ...
3799 439 frc977 2002 53 2019vapor red False 3.722826 21.191992 24.914817
3800 440 frc977 2002 60 2019vapor red False 3.722826 21.191992 24.914817
3801 441 frc977 2002 59 2019vapor red False 3.722826 21.191992 24.914817
3802 442 frc977 2002 59 2019vapor red False 3.722826 21.191992 24.914817
3803 443 frc977 2002 68 2019vapor red False 3.722826 21.191992 24.914817

3804 rows × 10 columns

Data Representation¶

With the data we've collected so far, let's take a look at what scores at each event looked like:

In [31]:
plt.rcParams["figure.figsize"] = (10,10)
In [32]:
for event in all_matches['event_code'].unique():
    temp = all_matches.loc[all_matches['event_code'] == event]

    for team in temp['team_name'].unique():
        temp2 = temp[temp['team_name'] == team]    
        match_counter = 1

        matches = []
        score = []
        for index, match in temp2.iterrows():
            matches.append(match_counter)
            score.append(match['score'])
            match_counter = match_counter + 1

        plt.plot(matches, score, label = team)
        
    plt.title("Qualification Scores for the Winning Teams at " + event)
    plt.xlabel("Match Played")
    plt.ylabel("Score from Match")
    plt.show()

From the jumble of lines for each event, there is very little we can tell. For most events, it seems like the average score per qualification match ranged from 40-60 points, and points of extreme jumps were seen for many teams across matches. At the district championship (2019chcmp), 2 teams achieved season high scores of 110+, which tracks for the caliber of play expected at district championships. To get a more meaningful breakdown of this data, we should filter based on one feature. Logically, I would choose the winning teams per event, as it's a much smaller subset to analyze.

Exploratory Analysis and Data Visualization¶

From the curated data, one major aspect I'm interested to see is how the age of a team influences the chance of winning at an event. Age of a team can be explored through win rate, as well as how the additional statistics vary among the winning teams. But first:

Analyzing General Score Trends for Winning Teams per Event¶

While the general thought pattern might be to consider that the 3-4 teams that win at each event might consistently have high scores throughout qualifications might be instinct, this is not necessarily true. During alliance selection, the Top 8 teams choose their first alliance member going in order from Rank 1 to Rank 8, and then the second alliance member is chosen going in reverse order from Rank 8 to Rank 1. There are also some underlying factors that may not be captured by the collected data -- whether some teams have strong friendships with each other, or if a team was purely unlucky with their match schedule and paired with underperforming robots, driving their rankings and statistics down when in reality they might be a strong sleeper pick.

Regardless, I think it's important to look at the score data for the different winners and analyze what's there.

In [33]:
for event in all_matches['event_code'].unique():
    temp = all_matches.loc[all_matches['event_code'] == event]
    
    # Pulling out teams that won at the event
    temp2 = temp[temp['won_event'] == True]

    for team in temp2['team_name'].unique():
        temp3 = temp[temp['team_name'] == team]    
        match_counter = 1

        matches = []
        score = []
        
        # Plotting each match the team played and the score for that match
        for index, match in temp3.iterrows():
            matches.append(match_counter)
            score.append(match['score'])
            match_counter = match_counter + 1

        plt.plot(matches, score, label = team)

    plt.legend(loc="upper left")
    plt.title("Qualification Scores for the Winning Teams at " + event)
    plt.xlabel("Match Played")
    plt.ylabel("Score from Match")
    plt.show()

From the above graphs, I think it's safe to say that the scores alone do not tell us much, similar to the graphs from the previous step we completed. There is a great amount of variation in scores that occurs across all matches across all events. One interesting thing to note from this is that when teams reach the district championship level, events have already been happening for 5-6 weeks. Most teams have already played at 2 events, some even at 3 events if they pay to play an extra week. Not everyone qualifies for district championships, so you would logically expect that the caliber of play at the district championship level should be higher. This is suprising for the 2019chcmp data, as we can see frc4541 has two drastic drops in their qualification scores, yet they were still able to move onto playoffs and win at the district championship level.

Analyzing Team Age vs Win Rate¶

When populating teams that had won at each event, I made sure to avoid duplicates while looping through. Likewise, I made sure to only include teams once, even if they had won at multiple events (which is a possibility). Therefore, while we'd expect 24-32 unique possible winners (8 events, 3 winners best case vs 8 events, 4 winners worst case), it's possible for there to be less than 24 teams in this population.

In [34]:
winning_years = []
winning_teams_unique = []
winning_teams = []

for event in all_matches['event_code'].unique():
    temp = all_matches.loc[all_matches['event_code'] == event]
    
    temp2 = temp.drop_duplicates(subset = 'team_name')

    for index, row in temp2.iterrows():
        if row['won_event'] and (row['team_name'] in winning_teams) == False:
            winning_years.append(row['rookie_year'])
            winning_teams_unique.append(row['team_name'])
        if row['won_event']:
            winning_teams.append(row['team_name'])
In [35]:
# Curating the amount of teams that won per Rookie Year
winning_years = pd.value_counts(winning_years)
winning_years = winning_years.to_frame()
winning_teams = pd.value_counts(winning_teams)

for index, row in winning_years.iterrows():
    winning_years.at[index, 'year'] = index
In [36]:
winning_years['year'] = winning_years['year'].astype(int)
winning_years = winning_years.rename(columns = {0 :'value'})
winning_years = winning_years.sort_values(by=['year'])
In [37]:
winning_years
Out[37]:
value year
2000 3 2000
2001 4 2001
2002 1 2002
2004 2 2004
2005 2 2005
2006 2 2006
2008 1 2008
2009 1 2009
2010 1 2010
2011 1 2011
2013 1 2013
2017 1 2017
2018 1 2018
In [38]:
winning_teams
Out[38]:
frc612     2
frc619     2
frc6882    2
frc346     2
frc6543    1
frc1599    1
frc2363    1
frc1731    1
frc539     1
frc1262    1
frc3274    1
frc1885    1
frc2849    1
frc1418    1
frc836     1
frc614     1
frc3748    1
frc4541    1
frc401     1
frc449     1
frc1610    1
dtype: int64

From the above, we can see how there were only 21 unique winning teams, with 4 teams winning at 2 district events. Therefore, it's important that we remove these duplicates in our analysis.

In [39]:
winning_years.plot.bar(x='year', y='value', rot=0, title = "Number of Winning Teams per Rookie Year")
Out[39]:
<AxesSubplot:title={'center':'Number of Winning Teams per Rookie Year'}, xlabel='year'>

From this, we can see how 7/21, or 1/3, of winners that won events in 2019 were founded in 2000 or 2001. Furthermore, half of the teams that won were founded at least 13 years before the 2019 season occurred. Looking at the 5 years prior to the 2019 season, only 2 teams in that time frame won at an event in 2019. This could pose an early indicator that the older a team is, the more likely they are to win at an event.

Analyzing Team Age vs the different proivided statistics¶

In [40]:
opr = []
dpr = []
ccwm = []
years = []
winning_teams = []

for event in all_matches['event_code'].unique():
    temp = all_matches.loc[all_matches['event_code'] == event]
    
    temp2 = temp.drop_duplicates(subset = 'team_name')

    for index, row in temp2.iterrows():
        if row['won_event'] and (row['team_name'] in winning_teams) == False:
            years.append(row['rookie_year'])
            opr.append(row['oprs'])
            dpr.append(row['dprs'])
            ccwm.append(row['ccwms'])
In [41]:
zipped = list(zip(years, opr, dpr, ccwm))
statistics = pd.DataFrame(zipped, columns = ['Year', 'OPR', 'DPR', 'CCWM'])
In [42]:
seaborn.scatterplot(x="Year",
                    y="OPR",
                    data=statistics).set(title = "OPR vs Rookie Year of Winning Teams in 2019")
Out[42]:
[Text(0.5, 1.0, 'OPR vs Rookie Year of Winning Teams in 2019')]

Looking at OPR vs Rookie Year, we can see a general downward trend in OPR for newer teams, indicating that older teams are able to provide more points per match during qualifications, and logically, into playoff matches as well. We can see how drastic this trend is if we fit a line to our scatterplot:

In [45]:
slope, intercept, r_value, pv, se = stats.linregress(statistics['Year'],statistics['OPR'])
In [46]:
print(slope)
-0.9723503931576458
In [47]:
print(intercept)
1972.2223485557051
In [48]:
r_value
Out[48]:
-0.7043580944090574

From the above regression, we can see that the OPR vs Year yields a regression equation of y = -0.972x + 1972.222, where x = the rookie year of the team. This regression also has an R value of -0.704, which statistically indicates that this relationship is significant.

In [49]:
seaborn.scatterplot(x="Year",
                    y="DPR",
                    data=statistics).set(title = "DPR vs Rookie Year of Winning Teams in 2019")
Out[49]:
[Text(0.5, 1.0, 'DPR vs Rookie Year of Winning Teams in 2019')]

Compared to the OPR vs Year scatter plot, this plot has a more randomized spread, and does not really speak to whether there's any significant relationship between DPR and Rookie Year of winning teams.

In [50]:
seaborn.scatterplot(x="Year",
                    y="CCWM",
                    data=statistics).set(title = "CCWM vs Rookie Year of Winning Teams in 2019")
Out[50]:
[Text(0.5, 1.0, 'CCWM vs Rookie Year of Winning Teams in 2019')]

Similar to the OPR vs Rookie Year plot from above, we can see a general downward trend in CCWM for newer teams, indicating that older teams are able to be more impactful to their alliance during qualifications, and logically, into playoff matches as well. We can see how drastic this trend is if we fit a line to our scatterplot:

In [51]:
slope, intercept, r_value, pv, se = stats.linregress(statistics['Year'],statistics['CCWM'])
In [52]:
print(slope)
-0.8721593643810367
In [53]:
print(intercept)
1755.6727851186945
In [54]:
r_value
Out[54]:
-0.6213857140427312

Thus from above, we are presented with the regression equation of y = -0.872x + 1755.673, where x is the Rookie Year of the team. This line also has an r-value of -0.621, which isn't necessarily significant enough to suggest that a significant relationship between CCWM and Year.

Based on the above result showing significance for the OPR vs Rookie Year, I decided to create a model to determine how accurate the OPR is in predicting the winner of an event. This task leads us to the next section:

Hypothesis Testing/Machine Learning¶

Before starting, it's important to state what my different hypotheses are for this model:

Null Hypothesis: The Rookie Year, Match Score, and OPR is not a good measure to determine whether a team will win at an event.

Alternative Hypothesis: The Rookie Year, Match Score, and OPR is a good measure to determine whether a team will win at an event.

While we saw earlier that there was significance between OPR and the Rookie Year of the winning team, we seek to see how this relationship translates in predicting winners based on these two factors, as well as per match score.

In [55]:
ind_data = all_matches[['rookie_year', 'score', 'oprs']]
In [56]:
ind_data
Out[56]:
rookie_year score oprs
0 2003 29 10.614260
1 2003 36 10.614260
2 2003 54 10.614260
3 2003 29 10.614260
4 2003 42 10.614260
... ... ... ...
3799 2002 53 24.914817
3800 2002 60 24.914817
3801 2002 59 24.914817
3802 2002 59 24.914817
3803 2002 68 24.914817

3804 rows × 3 columns

In [57]:
dep_data = all_matches['won_event']
In [58]:
dep_data
Out[58]:
0       False
1       False
2       False
3       False
4       False
        ...  
3799    False
3800    False
3801    False
3802    False
3803    False
Name: won_event, Length: 3804, dtype: bool

Using holdout validation, I created training and test data based on the data I collected to train on my model. Because I was using a classifier model, I decided to test this with two different model types: Decision Trees and Linear Discriminant Analysis. After running both models, I will analyze the accuracy scores of each to determine how significant both models are at predicting winners based on the three provided features.

In [59]:
ind_train, ind_test, dep_train, dep_test = ms.train_test_split(ind_data, dep_data, random_state=13)
In [60]:
decision_tree = DecisionTreeClassifier()
decision_tree = decision_tree.fit(ind_train, dep_train)
dt_predicted = decision_tree.predict(ind_test)
In [61]:
# Accuracy Score for Decision Tree Model
met.accuracy_score(dep_test, dt_predicted)
Out[61]:
0.9989484752891693
In [62]:
lda = LinearDiscriminantAnalysis()
lda = lda.fit(ind_train, dep_train)
lda_predicted = lda.predict(ind_test)
In [63]:
# Accuracy Score for the Linear Discriminant Analysis
met.accuracy_score(dep_test, lda_predicted)
Out[63]:
0.9263932702418507

From both of the accuracy scores above, it is evident that both models were able to predict winners of events based qualification scores, rookie year, and OPR ratings of teams with more than 92% accuracy. Thus, in terms of my hypothesis, I would conclude that there is significant evidence from the accuracy score to state that the Rookie Year, Match Score, and OPR is a good measure to determine whether a team will win at an event.

There is one important caveat/limitation to note here: The OPR ranking from the TBA API only lists one averaged value per team, not on a per match level. Further analysis and data collection would be needed (if the OPR even existed at a per match level, or manually calculating that value) to determine how a changing OPR value after each match would impact both of the models and their predictive capabilities.

Conclusion¶

Let's recap what we've seen so far: 1) We started by gathering our relevant match data for the 2019 season from the TBA API 2) We then graphed score data for each event -- which really didn't provide anything of significance to us. To further understand what information the data held, we broke it down further in the exploratory data analysis. 3) Through exploratory data analysis, we analyzed possible trends in winning teams, and reached a conclusion about a relationship between the Rookie Year of a winning team and their OPR. 4) Using this initial conclusion, we trained two models to take in those two features, as well as scores from each qualification match, to create models that would predict whether a team would win the event based on those three features.

What does this mean, what comes next?¶

We now have a model that works based on data provided from the FIRST Chesapeake District. Further expansion of this model could be to include other districts (i.e. FIRST in Texas, First Mid-Atlantic, FIRST in Michigan, etc.) into our model and see how accurate the model is at predicting for a larger set of data. The basis of this model could also be applied to different seasons, and see how accurate the model is at predicting winners across seasons. This model could also be adapted for a regional system, but the predictive power of it is stronger within a district system. Since teams play multiple events in a district, this model could also be adapted to see how change in OPRs between events impacts a team's chance at winning at their second event.

In general, there's also one thing to note about any FRC data that was used here, and could be used in future iterations/extensions of this model. Within the FRC community, there are many teams that are commonly known to perform well, and consistently win district events/regionals, and some even at the championship level, year after year. Some examples of these teams include FRC254, FRC1678, and FRC1114. An FRC season can be a costly affair -- teams have budgets ranging from a few thousand dollars to amounts well in the six figure range. It's important to note that money does play a factor in how well teams perform -- from what equipment and tools they are able to purchase, to other expenses. Similarly, other resources such as mentor support, building conditions, and student makeup of each team play a strong role in how well teams perform. This may not be entirely captured by the data presented here.